[SPARK-38094] Enable matching schema column names by field ids #35385

jackierwzhang · 2022-02-03T06:08:41Z

What changes were proposed in this pull request?

Field Id is a native field in the Parquet schema (https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398)

After this PR, when the requested schema has field IDs, Parquet readers will first use the field ID to determine which Parquet columns to read if the field ID exists in Spark schema, before falling back to match using column names.

This PR supports:

Vectorized reader
parquet-mr reader

Why are the changes needed?

It enables matching columns by field id for supported DWs like iceberg and Delta. Specifically, it enables easy conversion from Iceberg (which uses field ids by name) to Delta, and allows id mode for Delta column mapping

Does this PR introduce any user-facing change?

This PR introduces three new configurations:

spark.sql.parquet.fieldId.write.enabled: If enabled, Spark will write out native field ids that are stored inside StructField's metadata as parquet.field.id to parquet files. This configuration is default to true.

spark.sql.parquet.fieldId.read.enabled: If enabled, Spark will attempt to read field ids in parquet files and utilize them for matching columns. This configuration is default to false, so Spark could maintain its existing behavior by default.

spark.sql.parquet.fieldId.read.ignoreMissing: if enabled, Spark will read parquet files that do not have any field ids, while attempting to match the columns by id in Spark schema; nulls will be returned for spark columns without a match. This configuration is default to false, so Spark could alert the user in case field id matching is expected but parquet files do not have any ids.

How was this patch tested?

Existing tests + new unit tests.

…ang/spark into SPARK-38094-field-ids

AmplabJenkins · 2022-02-03T09:06:56Z

Can one of the admins verify this patch?

sadikovi

Thanks for opening a PR. I left a few comments and would appreciate it if you could address them.

sadikovi · 2022-02-03T22:01:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+   val PARQUET_FIELD_ID_ENABLED =
+    buildConf("spark.sql.parquet.fieldId.enabled")
+      .doc("Field ID is a native field of the Parquet schema spec. When enabled, Parquet readers" +
+        " will use field IDs (if present) in the requested Spark schema to look up Parquet" +


How does it work when there is a mixture of columns that have field id set and ones that don't?

It would try to match by id if id exists, otherwise, it would fall back to match by name.

My understanding is that the code would use field ids if the flag is enabled, if the flag is disabled, the code would use names instead. My main concern is ambiguity resolution in schema.

Ah, I meant even when this flag is enabled, my statement above still applies: the matching is a best-effort basis.

Disabling this flag will complete avoid reading and writing field ids.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFieldIdIOSuite.scala

huaxingao · 2022-02-04T08:05:02Z

does not support:

Parquet-mr reader due to lack of field id support (needs a follow up ticket)

Just for my own knowledge: what needs to be done to make parquet-mr support field id?

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

jackierwzhang · 2022-02-05T01:05:24Z

does not support:

Parquet-mr reader due to lack of field id support (needs a follow up ticket)

Just for my own knowledge: what needs to be done to make parquet-mr support field id?

I am still investigating, previously I thought it requires support from parquet-mr, but now looks like it's not necessary.

I am working on a fix locally, which might be pushed out as part of this PR or another.

huaxingao · 2022-02-07T04:41:04Z

@jackierwzhang
FYI: I am working with @shangxinli on column id resolution in parquet-mr link, with pretty much the same motivation as yours. The work will probably overlap with yours.
One thing that I just realized is that the field id can be NOT unique in schema. For example:

message ParquetSchema {
  required group reqMap (MAP) = 1 {
    repeated group key_value (MAP_KEY_VALUE) {
      required binary key (STRING);
      optional group value (MAP) {
        repeated group key_value (MAP_KEY_VALUE) {
          required binary key (STRING);
          optional group value {
            required binary name (STRING) = 1;
            optional binary age (STRING) = 2;
            optional binary gender (STRING) = 3;
            optional group addedStruct = 4 {
              required binary name (STRING) = 1;
              optional binary age (STRING) = 2;
              optional binary gender (STRING) = 3;
            }
          }
        }
      }
    }
  }
}

I probably need to change the format specification to make the field id unique.

jackierwzhang · 2022-02-07T06:17:56Z

@huaxingao

Got it.

As for duplicated field id, I think in my approach, reading parquet files with duplicated id across different groups are allowed, essentially we just don't want confusion when matching fields which are on the same level in the schema.

Btw just curious, since you have been working on field id resolution for parquet-mr, do you know whether it currently supports reading and writing field ids yet?

huaxingao · 2022-02-07T07:58:38Z

I think in my approach, reading parquet files with duplicated id across different groups are allowed, essentially we just don't want confusion when matching fields which are on the same level in the schema.

Sounds reasonable. I hope I can do the same too, but seems to me that I need to resolve the column by id only, which requires that the id to be unique in the entire schema. This is going to be a breaking change. Not sure if I am allowed to do it or not.

It doesn't seem to me that parquet-mr supports reading and writing field ids yet. The field ids are not in ColumnDescriptor.

jackierwzhang · 2022-02-07T09:14:58Z

@huaxingao Are you suggesting the code that:

aren't really doing anything?

huaxingao · 2022-02-07T17:46:22Z

@jackierwzhang
No, those are set correctly.
What I meant is that the field ids are not really used. Seems only the ColumnPath is used in column index, column resolution, etc. I am thinking of adding field id in ColumnDescriptor and keeping a map between id and ColumnDescriptor, or a map between id and ColumnPath.

sadikovi

Looks good. I left a few comments and would appreciate it if you could take a look. Thanks!

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFieldIdIOSuite.scala

...est/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFieldIdSchemaSuite.scala

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

jackierwzhang · 2022-02-08T04:20:22Z

@jackierwzhang No, those are set correctly. What I meant is that the field ids are not really used. Seems only the ColumnPath is used in column index, column resolution, etc. I am thinking of adding field id in ColumnDescriptor and keeping a map between id and ColumnDescriptor, or a map between id and ColumnPath.

Got it. I was asking because I tested locally and found that parquet-mr can actually save and read field ids via Spark, so I don't have to patch anything for the parquet-mr repo.

Tho there are a couple of small problems remaining for id matching on the parquet-mr side, I believe It's possible to extend this PR (or open another) to enable spark to match by id in that code path; I'm gonna do that soon.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala

sadikovi

Looks good. I left a few minor comments, I would appreciate it if you could follow up. Thank you!

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

sadikovi · 2022-02-09T05:09:02Z

sql/core/src/test/scala/org/apache/spark/sql/test/TestSQLContext.scala

-      SQLConf.SHUFFLE_PARTITIONS.key -> "5")
+      SQLConf.SHUFFLE_PARTITIONS.key -> "5",
+      // Enable parquet read field id for tests to ensure correctness
+      SQLConf.PARQUET_FIELD_ID_READ_ENABLED.key -> "true"


Does this mean that we will not test match by name and will always test by field id?

Not really. Again, enabling this flag would only try to match field ids if they exist, but disabling this flag will completely ignore matching using field id. so if I read with a spark schema that has no ids at all, and turn on this flag, it would be exactly the same as name matching.

I wanted to enable this flag for all tests to detect any regressions in existing test cases, in case when this flag is turned on by default in the future.

But they would exist once we start writing field ids for all of the fields, would not they?

Yeah, but it requires the original schema to contain parquet.field.id metadata, which is not present in any of the existing suites, so it should behavior exactly like name matching.

Turning this on actually ensures that we didn't introduce any regression for existing code under this mixed matching mode, and detects if this metadata field has been used anywhere.

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetWrite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala

...est/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFieldIdSchemaSuite.scala

sunchao

This looks pretty good to me. Just some cosmetic comments.

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

sunchao · 2022-02-09T19:14:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala

+  def hasFieldId(field: StructField): Boolean =
+    field.metadata.contains(FIELD_ID_METADATA_KEY)
+
+  def getFieldId(field: StructField): Int = {


maybe we can consider to combine getFieldId and hasFieldId into a single method:

def getFieldId(field: StructField): Option[Int]

IMHO, this is fine. I see that hasField() is used separately, and the assertion would still have be implemented somewhere anyway. As long as there is a test for this, it should be good.

sadikovi · 2022-02-09T22:24:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala

+  def hasFieldId(field: StructField): Boolean =
+    field.metadata.contains(FIELD_ID_METADATA_KEY)
+
+  def getFieldId(field: StructField): Int = {


IMHO, this is fine. I see that hasField() is used separately, and the assertion would still have be implemented somewhere anyway. As long as there is a test for this, it should be good.

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFieldIdIOSuite.scala

...est/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFieldIdSchemaSuite.scala

...rc/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFieldIdIOSuite.scala

sadikovi · 2022-02-09T22:29:09Z

sql/core/src/test/scala/org/apache/spark/sql/test/TestSQLContext.scala

-      SQLConf.SHUFFLE_PARTITIONS.key -> "5")
+      SQLConf.SHUFFLE_PARTITIONS.key -> "5",
+      // Enable parquet read field id for tests to ensure correctness
+      SQLConf.PARQUET_FIELD_ID_READ_ENABLED.key -> "true"


But they would exist once we start writing field ids for all of the fields, would not they?

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala

sadikovi

Looks good. I think you can remove WIP label from your PR as it is not longer work in progress.

Approved pending addressed comments.

cloud-fan · 2022-02-18T15:12:03Z

thanks, merging to master!

### What changes were proposed in this pull request? Minor follow ups on #35385: 1. Add a nested schema test 2. Fixed an error message. ### Why are the changes needed? Better observability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test Closes #35700 from jackierwzhang/SPARK-38094-minor. Authored-by: jackierwzhang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Enable matching by field ids

8f2f482

github-actions bot added the SQL label Feb 3, 2022

jackierwzhang added 3 commits February 3, 2022 17:19

Enable matching by field ids

2dd384b

Merge branch 'SPARK-38094-field-ids' of https://github.com/jackierwzh…

dff650c

…ang/spark into SPARK-38094-field-ids

retrigger test

6d7e769

sadikovi reviewed Feb 3, 2022

View reviewed changes

Address some comments

83a3184

huaxingao reviewed Feb 4, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

huaxingao reviewed Feb 4, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

huaxingao reviewed Feb 4, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

typos & conf

bf7ddba

sadikovi reviewed Feb 7, 2022

View reviewed changes

jackierwzhang added 3 commits February 8, 2022 16:18

Address comments

eb21dc5

Enable parquet-mr code path

0495e0b

Address comments

b7f76f7

martin-g reviewed Feb 8, 2022

View reviewed changes

address comments

4111582

sadikovi reviewed Feb 9, 2022

View reviewed changes

Address more comments

723dff7

sunchao reviewed Feb 9, 2022

View reviewed changes

sadikovi reviewed Feb 9, 2022

View reviewed changes

sadikovi approved these changes Feb 9, 2022

View reviewed changes

comments & refactoring

53f44d1

jackierwzhang changed the title ~~[WIP][SPARK-38094] Enable matching schema column names by field ids~~ [SPARK-38094] Enable matching schema column names by field ids Feb 10, 2022

jackierwzhang added 2 commits February 10, 2022 19:14

Conf name update

308212c

Remove the grid testing

9c0b239

cloud-fan approved these changes Feb 18, 2022

View reviewed changes

cloud-fan closed this in b5eae59 Feb 18, 2022

jackierwzhang mentioned this pull request Mar 1, 2022

[SPARK-38094][SQL][FOLLOWUP] Fix exception message and add a test case #35700

Closed

CTTY mentioned this pull request Jul 20, 2022

[HUDI-4186] Support Hudi with Spark 3.3.0 apache/hudi#5943

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38094] Enable matching schema column names by field ids #35385

[SPARK-38094] Enable matching schema column names by field ids #35385

jackierwzhang commented Feb 3, 2022 •

edited

Loading

AmplabJenkins commented Feb 3, 2022

sadikovi left a comment

sadikovi Feb 3, 2022

jackierwzhang Feb 3, 2022 •

edited

Loading

sadikovi Feb 7, 2022

jackierwzhang Feb 8, 2022 •

edited

Loading

huaxingao commented Feb 4, 2022

jackierwzhang commented Feb 5, 2022

huaxingao commented Feb 7, 2022

jackierwzhang commented Feb 7, 2022 •

edited

Loading

huaxingao commented Feb 7, 2022

jackierwzhang commented Feb 7, 2022 •

edited

Loading

huaxingao commented Feb 7, 2022

sadikovi left a comment

jackierwzhang commented Feb 8, 2022

sadikovi left a comment

sadikovi Feb 9, 2022

jackierwzhang Feb 9, 2022 •

edited

Loading

sadikovi Feb 9, 2022

jackierwzhang Feb 10, 2022 •

edited

Loading

sunchao left a comment

sunchao Feb 9, 2022

sadikovi Feb 9, 2022

sadikovi Feb 9, 2022

sadikovi Feb 9, 2022

sadikovi left a comment

cloud-fan commented Feb 18, 2022

[SPARK-38094] Enable matching schema column names by field ids #35385

[SPARK-38094] Enable matching schema column names by field ids #35385

Conversation

jackierwzhang commented Feb 3, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented Feb 3, 2022

sadikovi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackierwzhang Feb 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackierwzhang Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

huaxingao commented Feb 4, 2022

jackierwzhang commented Feb 5, 2022

huaxingao commented Feb 7, 2022

jackierwzhang commented Feb 7, 2022 • edited Loading

huaxingao commented Feb 7, 2022

jackierwzhang commented Feb 7, 2022 • edited Loading

huaxingao commented Feb 7, 2022

sadikovi left a comment

Choose a reason for hiding this comment

jackierwzhang commented Feb 8, 2022

sadikovi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackierwzhang Feb 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackierwzhang Feb 10, 2022 • edited Loading

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadikovi left a comment

Choose a reason for hiding this comment

cloud-fan commented Feb 18, 2022

jackierwzhang commented Feb 3, 2022 •

edited

Loading

jackierwzhang Feb 3, 2022 •

edited

Loading

jackierwzhang Feb 8, 2022 •

edited

Loading

jackierwzhang commented Feb 7, 2022 •

edited

Loading

jackierwzhang commented Feb 7, 2022 •

edited

Loading

jackierwzhang Feb 9, 2022 •

edited

Loading

jackierwzhang Feb 10, 2022 •

edited

Loading